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Abstract 



In many areas of machine learning, it becomes necessary to find the eigenvector 
decompositions of large matrices. We discuss two methods for reducing the computa- 
tional burden of spectral decompositions: the more venerable Nystrom extension and a 
I— I newly introduced algorithm based on random projections. Previous work has centered 

H— I on the ability to reconstruct the original matrix. We argue that a more interesting and 

_^ relevant comparison is their relative performance in clustering and classification tasks 

• using the approximate eigenvectors as features. We demonstrate that performance is 

C^ task specific and depends on the rank of the approximation. 



1 Introduction 



^^ Spectral Connectivity Analysis (SCA) is a burgeoning area in statistical machine learn- 

■^ ing. Beginning with principal components analysis (PCA) and Fisher discriminant analysis 

_j. and continuing with locally linear embeddings [12], SCA techniques for discovering low- 

I • dimensional metric embeddings of the data have long been a part of data analysis. More- 

(^ over, newer techniques, such as Laplacian eigenmaps [2] and Hessian maps \3\, similarly seek 

^~~^ to elucidate the underlying geometry of the data by analyzing approximations to certain 

. . operators and their respective eigenfunctions. 

_^ Though the techniques are different in many respects, they share one major character- 

s' istic: the necessity for the computation of a postive definite kernel and, more importantly, 

H its spectrum. Unfortunately, a well known speed limit exists. For a matrix W G M"^", the 

eigenvalue decomposition can be computed no faster than O(n^). On its face, this limit 
constrains the applicability of SCA techniques to moderately sized problems. 

Fortunately, there exist methods in numerical linear algebra for the approximate com- 
putation of the spectrum of a matrix. Two such methods are the Nystrom extension |1^ 
and another algorithm [7] we have dubbed 'Gaussian projection' for reasons that will be- 
come clear. Each seeks to compute the eigenvectors of a smaller matrix and then embed 
the remaining structure of the original matrix onto these eigenvectors. These methods are 
widely used in the applied SCA community e.g. ^ i5j. However, the choice as to the ap- 
proximation method has remained a subjective one, largely based on the subfield of the 
respective researchers. 

The goal of this paper is to provide a careful, comprehensive comparison of the Nystrom 
extension and Gaussian projection methods for approximating the spectrum of the large 



matrices occuring in SCA techniques. As SCA is such a vast field, we concentrate on one 
specific technique, known as diffusion map from the graph Laplacian. In this technique, 
it is necessary to compute the spectrum of the Laplacian matrix L. For a dataset with 
n observations, the matrix L is n by n, and hence the exact numerical spectrum quickly 
becomes impossible to compute. 

Previous work on a principled comparison of the Nystrom extension and Gaussian pro- 
jection methods have left open some important questions. For instance, [U [13] both pro- 
vide contrasts that, while interesting, are limited in that they only consider approximation 
methods based on column sampling. Furthermore, performance evaluations in the literature 
focus on reconstructing the matrix W. However, in the machine learning community, this 
is rarely the appropriate metric. Generally, some or all of the eigenvectors are used as an 
input to standard classification, regression, or clustering algorithms. Hence, there is a need 
for comparing the effect of these methods on the accuracy of learning machines. 

We find that for these sorts of applications, neither Nystrom nor Gaussian projection 
achieves uniformly better results. For similar computational costs, both methods perform 
well at manifold reconstruction. However, Gaussian projection performs relatively poorly in 
a related clustering task. Yet, it also outperforms Nystrom on a standard classification task. 
We also find that the choice of tuning parameters can be easier for one method relative to 
the other depending on the application. 

The remainder of this paper is structured as follows. In ^ we give an overview of 
our particular choice of SCA technique — the diffusion map from the graph Laplacian — 
as well as introduce both the Nystrom and the Gaussian projection methods. In ^ we 
present our empirical results on three tasks: low dimensional manifold recovery, clustering, 
and classification. Lastly, ^ details areas in need of further investigation and provides a 
synthesis of our findings about the strengths of each approximation method. 

2 Diffusion maps and spectral approximations 

2.1 Diffusion map with the graph Laplacian 

The technique in SCA that we consider is commonly referred to as a diffusion map from 
the graph Laplacian. Roughly, the idea is to construct an adjacency graph on a given data 
set and then find a parameterization for the data which minimizes an objective function 
that preserves locality. We give a brief overview here. See |1U| for a more comprehensive 
treatment. 

Specifically, suppose we have n observations, xi, . . . , Xn- Define a graph G = iV, E, W) 
on the data such that V = {xi,...,x„} are the observations, E is the set of connections 
between pairs of observations, and VF is a weight matrix associated with every element in 
E, corresponding to the strength of the edge. We make the common choice (e.g. |lll §2.2.1]) 
of defining 

Wij = exp{-\\xi- XjW'^/e} (1) 

for all 1 < i,j < n such that (i, j) G E. Further, we define a normalized version of the 
matrix W. This can be done either symmetrically, as 

W := D-^/^WD-'^/'^, (2) 

or asymmetrically as 

W:=D-^W, (3) 



where D := diag(^"'^^ ^ij^ • • • i Yll=i ^nj)- In either case, the graph Laplacian is defined 
to be L := / — W. Here, / is the n by n identity matrix. 

One can show ([HI §8.1]) that the optimal p dimensional embedding is given by the 
difi^usion map onto eigenvectors two through p + 1 of l^j^ That is, find an orthonormal 
matrix U and diagonal matrix S such that W = UT,U and retain the second through 
{p+iy^ column of U as the features. Note that we have supressed any mention of eigenvalues 
in the diffusion map as they only represent rescaling in our applications and hence are 
irrelevant to our conclusions. Also, L and W have the same eigenvectors. 

In most machine learning scenarios, this map is created as the first step for more tra- 
ditional applications. It can be used as a design (also known as a feature) matrix in clas- 
sification or regression. Also, it can be used as coordinates of the data for unsupervised 
(clustering) techniques. 

The crucial aspect is the need to compute the eigenvectors of the matrix W. We give 



an overview of the Nystrom method in ^2.2 and Gaussian projection method in ^2.3 for 
approximating these eigenvectors. 

2.2 Nystrom 

As mentioned in ^ for most cases of interest in machine learning, n is large and hence the 
computation of the eigenvectors of the matrix W is computationally infeasible. However, 
the Nystrom extension gives a method for computing the eigenvectors and eigenvalues of 
a smaller matrix, formed by column subsampling, and 'extending' them to the remaining 
columns. 

The Nystrom method for approximating the eigenvectors of a matrix comes from a much 
older technique for finding a numerical solution of integral equations. For our purposes, the 
Nystrom method finds eigenvectors of a reduced matrix that approximates the action of WlQ 
Specifically, choose an integer m < n and an associated index subset M C {1, . . . , n} =: N 
such that \M\ = m to form an m x m matrix VF'"*) := Wm,m- Here our notation mimics 
common coding syntax. Subsetting a matrix with a set of indices indicates the retention 
of those rows or columns. Then, we find a diagonal matrix A*^*"^ and orthonormal matrix 

Jjim) g^pj^ ^j^j^i- 

These eigenvectors are then extended to an approximation of the matrix [7 by a simple 
formula. See Algorithm [T] for details and an outline of the procedure. 

Of course, choosing the subset M is important. By far the most common technique 
is to choose M by sampling m times without replacement from A^. We refer to this as 
the 'uniform Nystrom' method. Recently, however, there has been work on making more 
informed and data-dependent choices. The first work in this area we are aware of can 
be found in f^. More recently, [1] proposed sampling from a distribution formed by all 
(^) determinants of the possible retained matrices. This is of course not feasible, and [1] 
provides some approximations. We use the scheme where instead of a uniform draw from 
A^, we draw from A^ in proportion to the size of the diagonal elements in W . When used, 
this method is referred to as the 'weighted Nystrom.' 



'^The first eigenvalue and associated eigenvector correspond to the trivial solution and are discarded. 
^We consider the numerical spectral decomposition of the full matrix W to be the 'exact' or 'true' 
decomposition and ignore the effects of numerical error in our nomenclature. 



Algorithm 1 Nystrom approximation based on subsampling 

Given an n x n matrix W and integer m < n, compute an approximation V^y^ to the 

eigenvectors U of W. 

1: Compute C/^™-) and A(™-)via equation Q. 

2: Form 

Ai = — A "i = \/ h^^NM'^i ' 

777. V "■ A^ -^ 

i 

where X^ and u^"* are the i diagonal entry and i column of A^*"^ and U^"^', 
respectively. 
3: Return C/"s^^ = [Gi,...,G^] 



2.3 Gaussian projection 

An alternate to the Nystrom method is a very interesting new algorithm introduced in [7\. 
This new algorithm differs from other approximation techniques in that it produces, for any 
matrix A, a subspace of col(A) (the column space of A) through the action of A on a random 
set of linearly independent vectors. This is in contrast to randomly selecting columns of 
A to form a subspace of col(74) as in [4J and the Nystrom method. This is important as 
crucial features of col{A) can be missed by simply subsampling columns. 

We include a combination of algorithms 4.1 and 5.3 from [7J in Algorithm ^ for com- 
pleteness. 

2.4 Theoretical comparisons 

For both the Nystrom and the Gaussian projection methods, theoretical results have been 
developed demonstrating their performance at forming a rank m approximation to W. For 
the weighted Nystrom method, [1] finds that 

n 

E\\W-W''y'\\F<\\W-Wm\\ + C^Wi (5) 

where VF"?'* is the approximation to W via the Nystrom method and Wm is the best rank 
r?7 approximation. Likewise, for Gaussian projection , [7j finds that 

\\W-W3p\\f<2(i+^-^\6 (6) 

where Omin is the minimum singular value of Q^ Vl and 5 is a user defined tolerance for how 
much the projection QQ distorts VF, ie: 



However, these results are not generally comparable. To the best of our knowledge no 
lower bounds exist showing limits to the quality of the approximations formed by either 
method. So theoretical comparisons of these bounds are ill advised if not entirely meaning- 
less. Furthermore, it is not clear that reconstruction of the matrix W is beneficial to the 
machine learning tasks that the eigenvectors of W are used to accomplish. SCA methods 



Algorithm 2 Gaussian projection 



Given an n x n matrix W and integer m < n, we compute an orthonormal matrix Q that 
approximates col(VF) and use it to compute an approximation U^^ to the eigenvectors of 
W, U. 



Draw n X m Gaussian random matrix 0,. 

Form Y = Wn. 

Construct Q, an orthonormal matrix sucli that co\{Q) = col(y). 

Form B such that B minimizes jj-BQ^ri — (5'''l^||2, ie. B is the least squares solution. 

Compute the eigenvector decomposition of i3, ie: B = UT,U 

Return U^p = QU. 



often serve as dimensionality reduction tools to create features for classification, cluster- 
ing, or regression methods. This first pass dimension reduction does not obviate the need 
for doing feature selection with the computed approximate eigenvectors. In other words, 
the approximation methods help avoid one curse of dimensionality, in that large amounts 
of data incurrs the cost of a large amount of computations. An open question is: can 
the selection of the tuning parameter m also avoid another curse of dimensionality in that 
over-parameterizing the model leads to an increase in its prediction risk? 

A major difference in the Nystrom method versus the Gaussian projection method is 
that U^P remains an orthonormal matrix, and hence provides an orthogonal basis for col(W^). 
This can be seen by considering the output of Algorithm 2l writing Ui = Uf^- as the i 
column, and observing 



th 



{uf, uf) = {QUn,u QUn,j) = {UN,^, Un,j) = dij (7) 

by the orthonomality of Q and U. This is in contrast to Algorithm [ij that produces U"'^^ 
which is a rotation of Wn^m by the matrix f/'™) . This difference has implications that need 
further exploration. However, it does show that Gaussian projection provides a numerically 
superior set of features with which to do learning. 

3 Experimental results 

In order to evaluate the approximation methods mentioned above, we focus on three related 
tasks: manifold reconstruction, clustering, and classification. In the first two cases, we 
examine simulated manifolds with and without an embedded clustering problem. For the 
classification task, we examine the MNIST handwritten digit database [9j, using diffusion 
maps for feature creation and support vector machines (SVMs) to perform the classification. 

3.1 Manifold recovery 

To investigate the performance of the two approximation methods through the lens of 
manifold recovery, we choose a standard "fishbowl" like object. It is constructed using the 









Figure 1: This figure shows the two simulated manifolds we attempt to recover. The fishbowl 
on the left is a standard exercise in the literature, while the modified halo and ball on the 
right allows for a related clustering exercise 



parametric equations for an ellipsoid: 

u £ [0,2Tr) x = acosusmv 

V £ [djir] y = b sin u sin v 

z = ccosv, 

where (a, b,c) > determine the size of the ellipsoid and d G [0, vr) determines the size of the 
opening. We sample v uniformly over the interval, while n is a regularly spaced sequence 
of 2000 points. 

The dominant computational cost for the approximations are 0{nm'^) for Nystrom and 
0{'n?"m) for Gaussian projection , we choose the parameter m so that they have equal 
dominant cost. That is, if we choose nigp for the Gaussian projection method to be a 
certain value, we set rrinys for the Nystrom method to be runys = Jnmgp. For the Nystrom 
method, we take ninys = 141 while for Gaussian projection, rugp = 10. 

The manifold reconstruction of the ellipsoid is shown in Figure [2] Both methods perform 
reasonably well. The underlying structure is well recovered. For this figure, the tuning 
parameter was set at e = 15 in all three cases. Changing it makes very little difference 
(e € [1, 200] yields very similar figures over repeated randomizations^. This robustness with 
respect to tuning parameter choice is rarely the case as we will see in the next example Q 

3.2 Clustering 

To create a clustering example similar to the manifold recovery problem, we removed the 
bottom fifth of the fishbowl and created a glob in the middle of the resulting halo. See 
Figure [T] for a plot of the shape. We consider the light blue observations in the middle one 



^The magnitude of the tuning parameter depends heavily on the intrinsic separation in the data 
''Another standard manifold recovery example, the "swissroU", is very sensitive to tuning parameter 
choice as well as the choice of the matrix W . 
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Figure 2: Here, the true eigenvectors along with both approximate methods demonstrate the 
manifold recovery capabilities. For equivalent computational time, both methods perform 
reasonably. The true eigenvectors are on the left, the Nystrom method is in the center, 
while the Gaussian projection method is on the right. 



cluster and the outer ring as a second cluster. The result is a three dimensional clustering 
problem for which linear classifiers will fail. However, diffusion methods yield an embedding 
which will be separable even in one dimension via linear methods (Figure pi first column) . 
The next two columns show the Nystrom approximation with m. 



nys 



200 and for Gaussian 

projection with rugp = 20. We see that Nystrom provides a very faithful reproduction of the 
exact eigenvectors. Gaussian projection loses much of the separation present in the other 
two methods. 

It is important to keep in mind the selection of appropriate tuning parameters. Our 
choice in this example is a subjective one, based on visual inspection. In this scenario, tun- 
ing parameter selection becomes very difficult, and hugely important. Small perturbations 
in the tuning parameter lead to poor embeddings which not only yield poor clustering solu- 
tions, but which can completely destroy the structure visible in the data alone. Furthermore, 
tuning parameters must be selected separately for the exact as well as the two approximate 
methods with no guarantee that they will be similar. In our particular parameterization, 
the exact reconstruction was successful for e E [0.05,0.15] while the approximate methods 
required e ~ 0.25. The reconstruction via Gaussian projection shown in Figure |3] displays 
the best separation that we could achieve for this computational complexity. Clearly, the 
Nystrom method performs better in this case, yielding an embedding that remains easily 
separable even in one dimension. 

3.3 Classification 

A common machine learning task is to classify data based on a labelled set of training data. 
When all the data (training and test) are available, using the semi-supervised approach of a 
diffusion map from the graph Laplacian is a very reasonable technique for providing labels 
for the test data. However, this technique requires a very large spectral decomposition as 
it is based on the entire dataset, both training and test. 

We investigate classifying the handwritten digits found in the MNIST dataset and used 
in e.g. p]. See the left panel of Figure El for fifteen randomly selected digits. Using SVM 
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Figure 3: Using the true eigenvectors and with both approximation methods, we attempt 
to perform a clustering task. For equivalent computational time, the Nystrom method 
far outperforms Gaussian projection. The true eigenvectors are on the left, the Nystrom 
method is in the center, while the Gaussian projection method is on the right. 



with the (approximate) eigenvectors of the graph Laplacian as features, we attempt to 
classify a test set using each of the above described approximation techniques. Though the 
digits are rotated and skewed relative to each other, we do not consider deskewing as very 
good classification results have been obtained using SVM without any such preprocessing 
(see for example [8]). We choose the smoothing parameter in the SVM via 10 fold cross- 
validation. Additionally, we choose the bandwidth parameter of the diffusion map, e, by 
minimizing the misclassification error on the test set over a grid of values. 

For small enough datasets, we can compute the true eigenvectors to get an idea of 
the efficiency loss we incurr by using each approximation. If we choose a random dataset 
comprised of ntrain = 4000 test digits and ntcst = 800 training digits, then it is still feasible 
to calculate the true eigenvectors of the matrix W. Table [T] displays the results with 
an approximation parameter oi m = rrinys = fngp = 400 rl We see that the Gaussian 
projection method has the best performance of the three considered methods. Additionally, 
as expected, the weighted Nystrom method outperforms the uniform Nystrom, but only 
slightly. Note that the true eigenvector misclassification rate, 5.5%, is a bit worse than that 
reported in [8] — 1%. We attribute this to using a much smaller training sample for our 
classification procedure. 



""In the manifold recovery and clustering tasks, we are essentially interested only in the first few eigenvec- 
tors of W, so it makes sense to compare the two methods for equivalent computational complexity. However 
in the classification task, we want to create two comparable feature matrices which may then be regularized 
by the classifier. For this reason, it seems more natural to choose rrinys ~ rrigp. 



Method 



# Correct 



'l-test 


800 


True eigenvectors 


756 


Uniform Nystrom 


697 


Weighted Nystrom 


701 


Gaussian projection 


725 



Table 1: ntrain = 4000, m = 400 
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Figure 4: Left: A random sample of fifteen digits from the MNIST dataset. Right: Mis- 
classification rates from running the three approximation methods on a random subset of 
the MNIST digits for m G {50, 150, 450, 1350, 4050} with nt^ain = 15000 and ntest = 3000 
(see text for more details). The weighted and uniform Nystrom methods are very similar. 
The Gaussian projection method results in a worse misclassification rate for smaller m, but 
its performance improves markedly for larger m. 

Additionally, we ran a comparison with ntrain = 15000 and ntest = 3000. Here computing 
the true eigenvectors would begin to be truly time/resource consuming. For each m G 
{50, 150, 450, 1350, 4050} we perform the grid search to minimize test set misclassification 
for each method. We do this a total often times and average the results to somewhat account 
for the randomness of each approximation method. The results appear on the right hand 
side of Figure [4| We see that again the weighted and uniform Nystrom methods produce 
very similar misclassification rates. Interestingly, the Gaussian projection method results 
in a worse misclassification rate for smaller m, but its performance improves markedly for 
larger m. 

4 Discussion 

In this paper, we consider two main methods — the Nystrom extension and another method 
we dub Gaussian projection — for approximating the eigenvectors of the graph Laplacian 
matrix W. As a metric for performance, we are interested in ramifications of these approx- 
imations in practice. That is, how do the approximations affect the efficacy of standard 



machine learning techniques in standard machine learning tasks. 

Specifically, we consider three common applications: manifold reconstruction, clustering, 
and classification. For the last task we additionally investigate a newer version of the 
Nystrom method, one that allows for weighted sampling of the columns of W based on the 
size of the diagonal elements. 

We find that for these sorts of applications, neither Nystrom nor Gaussian projection 
achieves uniformly better results. For similar computational costs, both methods perform 
well at manifold reconstruction, each finding the lower dimensional structure of the fishbowl 
as a deformed plane. However, the Gaussian projection method recovery is somewhat 
distorted relative to the true eigenvectors. In the clustering task, Gaussian projection 
performs relatively poorly. Although the inner sphere is now linearly separable from the 
outer ring, the embedding is much noisier than for the true eigenvectors and the Nystrom 
extension. Lastly, we find that the weighted and uniform Nystrom methods result in nearly 
the same misclassification rates, with perhaps a slight edge to the weighted version. But, 
Gaussian projection outperforms both Nystrom methods as long as m is moderately large: 
about 10% of n. This value of m still provides a substantial savings over computing the 
true eigenvectors. 
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